Goal

In general, to develop methods that tell how well an exercise is performed. In this case, the goal was to predict which variant of the the Unilateral Dumbbell Biceps Curl was performed by the study participants.

The data is from accelerometers on the belt, forearm, arm, and dumbell of 6 participants performing barbell lifts correctly and incorrectly in 5 different ways (one was the correct for, four were not). More info: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).

Specific tasks:

Describe:

  1. how the model was built

  2. how crossvalidation was used

  3. the expected out of sample error

  4. why the choices were made

  5. predict 20 different test cases.

Model building

#get training and test data
trainURL <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
trainFile <- basename(trainURL)
testURL <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
testFile <- basename(testURL)

if(!file.exists(trainFile)) download.file(trainURL, destfile = trainFile)
trainSet <- read.csv(trainFile)
if(!file.exists(testFile)) download.file(testURL, destfile = testFile)
testSet  <- read.csv(testFile)
#not run to save space
head(trainSet)
summary(trainSet)
dim(trainSet)

Eliminate columns that contains NA, are empty or apparently non-informative.

library(dplyr)
trainSet <- trainSet[,!apply(trainSet, 
                             2, 
                             function(variable) 
                               any(is.na(variable)))]
trainSet <- trainSet[,!apply(trainSet, 
                             2, 
                             function(variable) 
                               any(variable == ""))]
trainSet <- select(.data = trainSet, -c("X",
                                        "cvtd_timestamp",
                                        "new_window"))

Data on how well the exercises were performed are stored in the “classe” variable.

Importantly, the assignment states specifically that one may use any of the variables to predict with.

Since each of the 6 participants performed a set of 10 repetitions of each exercise variant, the best way to classify the exercise variants could be by using the timestamp of the accelerometers together with the name of each participant as predictors. The following plot illustrate this point:

library(plotly)
plot_ly(x = trainSet$raw_timestamp_part_1,
        y = trainSet$classe,
        z = trainSet$user_name)

Since the data clearly follow a non-linear pattern, and since there are only two predictors, the best way to split the prediction space would be using a decision tree. Therefore, a classification model using the rpart package was trained.

library(caret)
#split training set into 80% training data and 20% validation data
set.seed(2020)
validSamples <- createFolds(trainSet$classe,
                            k = 5,
                            list = TRUE)[[1]]
validSet <- trainSet[validSamples,]
trainSet <- trainSet[-validSamples,]
dim(validSet); dim(trainSet)
## [1] 3925   57
## [1] 15697    57
#set training parameters to 5k crossvalidation but otherwise default
tc <- trainControl("cv",
                   5)
#actual training
(train.rpart <- train(classe ~ user_name + raw_timestamp_part_1, 
                      data = trainSet, 
                      method = "rpart",
                      trControl = tc))
## CART 
## 
## 15697 samples
##     2 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 12557, 12560, 12556, 12556, 12559 
## Resampling results across tuning parameters:
## 
##   cp          Accuracy   Kappa    
##   0.00000000  0.9929286  0.9910557
##   0.01028220  0.9930560  0.9912172
##   0.02056441  0.7097998  0.5950054
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.0102822.

At the selected pruning hyperparameter cp, in-sample accuracy and kappa were very high, more than 99%. Hopefully these good results are not due to overfitting. To estimate out-of-sample accuracy, the validation data was used:

confusionMatrix(validSet$classe,
                predict(train.rpart, newdata = validSet))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1109    7    0    0    0
##          B    4  753    3    0    0
##          C    0    3  673    8    0
##          D    0    0    1  638    4
##          E    0    0    0    0  722
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9924          
##                  95% CI : (0.9891, 0.9948)
##     No Information Rate : 0.2836          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9903          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9964   0.9869   0.9941   0.9876   0.9945
## Specificity            0.9975   0.9978   0.9966   0.9985   1.0000
## Pos Pred Value         0.9937   0.9908   0.9839   0.9922   1.0000
## Neg Pred Value         0.9986   0.9968   0.9988   0.9976   0.9988
## Prevalence             0.2836   0.1944   0.1725   0.1646   0.1850
## Detection Rate         0.2825   0.1918   0.1715   0.1625   0.1839
## Detection Prevalence   0.2843   0.1936   0.1743   0.1638   0.1839
## Balanced Accuracy      0.9970   0.9923   0.9954   0.9930   0.9972

The estimate for out-of-sample accuracy was also very high: 99.2% (CI = 98.9%-99.5%). Kappa was 99.0%.

These results provide evidence that the model is highly predictive for out-of-sample observation. These are the predictions for the test data:

predict(train.rpart, newdata = testSet)
##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

Prediction using actual positional data

For those who did not like the idea of using time to predict exercise variant, the positional data from the accelerometers can also be used.

In this case, since there are so many variables that probably, and since they probably do not vary with “classe” in a linear fashion, it might be better to consider a more complex model that takes into consideration the danger of overfitting associated with decision trees. Therefore, random forest was used, whih are basically a aggregation of decision trees that reduce variance without substantially increasing bias.

#eliminate not used variables
trainSet <- select(trainSet, -c("user_name",
                                "raw_timestamp_part_1",
                                "raw_timestamp_part_2",
                                "num_window"))
dim(trainSet)
## [1] 15697    53
#train rf model using same choices as above but on a parallel backend
library(doParallel); registerDoParallel(detectCores())
if(file.exists("train.rf.rds")){
  train.rf <- readRDS("train.rf.rds")
  } else {
    (train.rf <- train(classe ~ ., 
                      data = trainSet, 
                      method = "rf",
                      trControl = tc,
                      allowParallel = TRUE))
    saveRDS(train.rf, "train.rf.rds")
    }
#evaluate out-of-sample performance
confusionMatrix(validSet$classe,
                predict(train.rf,
                        newdata = validSet))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1116    0    0    0    0
##          B    1  758    1    0    0
##          C    0    5  679    0    0
##          D    0    0    5  637    1
##          E    0    0    0    1  721
## 
## Overall Statistics
##                                         
##                Accuracy : 0.9964        
##                  95% CI : (0.994, 0.998)
##     No Information Rate : 0.2846        
##     P-Value [Acc > NIR] : < 2.2e-16     
##                                         
##                   Kappa : 0.9955        
##                                         
##  Mcnemar's Test P-Value : NA            
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9991   0.9934   0.9912   0.9984   0.9986
## Specificity            1.0000   0.9994   0.9985   0.9982   0.9997
## Pos Pred Value         1.0000   0.9974   0.9927   0.9907   0.9986
## Neg Pred Value         0.9996   0.9984   0.9981   0.9997   0.9997
## Prevalence             0.2846   0.1944   0.1745   0.1625   0.1839
## Detection Rate         0.2843   0.1931   0.1730   0.1623   0.1837
## Detection Prevalence   0.2843   0.1936   0.1743   0.1638   0.1839
## Balanced Accuracy      0.9996   0.9964   0.9948   0.9983   0.9992

Both in-sample and out-of-sample accuracy are very high, with values of accuracy and kappa higher than 99%. These are the predictions for the test data, this time using the model that does not take time into consideration.

predict(train.rf, newdata = testSet)
##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
saveRDS(predict(train.rf, newdata = testSet), "predict.rds")